core: timeouts for batch jobs by pkazmierczak · Pull Request #27803 · hashicorp/nomad

pkazmierczak · 2026-04-07T15:40:42Z

This changeset introcudes max_run_duration task group configuration variable
for batch and sysbatch jobs. It's enforced in the alloc runner by a max run
hook. If the timer is up:

tasks are killed
allocation is marked as complete and with:

  Client Status       = complete
  Client Description  = allocation exceeded max_run_duration

job status is dead.

The max_run_duration timer starts in the allocation runner regardless of task
states. That is, tasks that take longer to start than the timeout get
terminated. This is to avoid long-starting tasks to run too long.

Lifecycle tasks count to the max_run_duration timer. prestart, poststart
tasks runtime count into the overall run time of the taskgroup, and poststop
tasks will not be started if the task was terminated due to running out of its
allocated time.

Two new metrics are emitted:

client.allocs.max_run_duration.configured_seconds
client.allocs.max_run_duration.remaining_seconds

Resolves #1782
Supersedes #18456
Internal ref: https://hashicorp.atlassian.net/browse/NMD-551

pkazmierczak · 2026-04-14T16:37:51Z

hey @schmichael @mismithhisler, thanks a lot for the comments! Many of the things you pointed out was messy code from the many iterations of this branch. Apologies, should've cleaned it up better. But amongst other things, I simplified the max_run_duration.go and removed all the state store stuff that didn't belong here.

I think the main issue that remains is how to approach task-level state. In its current shape, this code always waits for tasks to start. This is good, but causes issues with task lifecycle events, as @schmichael pointed out. It also means max_run_duration is ineffective in situations where tasks take too long to start, which I believe is one of the major reasons why people wanted this feature. The usecase you mentioned, @schmichael, about big artifacts that are slow to download etc., is a common pain point I think.

After doodling a bit with a more sophisticated solution, I am slowly leaning towards a more brutal one: make the timer start immediately in the alloc runner, regardless of task state. What do you think about this?

mismithhisler · 2026-04-14T18:42:02Z

 		ar.taskCoordinator.TaskStateUpdated(states)

 		// Get the client allocation
 		calloc := ar.clientAlloc(states)


To your point about this being tough because we have to wait for tasks to start. Doesn't this function do all the ugly "check to make sure all tasks are running" logic?

I'm curious if we can just pass the calloc.ID and calloc.ClientStatus to your MaxRunDuration object and have it just do the "yeah this alloc is now running, start a timer" logic

pkazmierczak · 2026-04-15T08:13:54Z

#27827 (branches off of this branch) implements the simplified version of max_run_duration that disregards task states. Consider the following jobspec:

job "maxrun" {
  type = "batch"

  group "maxrun" {

    max_run_duration = "10s"

    reschedule {
      attempts  = 15
      max_delay = "10s"
      unlimited = false
    }

    task "maxrun" {
      driver = "raw_exec"

      config {
        command = "/bin/sleep"
        args    = ["1000m"]
      }
    }

    task "maxrun2" {
      driver = "raw_exec"

      config {
        command = "/bin/sleep"
        args    = ["2s"]
      }
    }
  }

  group "huge_docker_image" {

    max_run_duration = "2s"
    task "prometheus" {
      driver = "docker"

      config {
        image = "prom/prometheus"
      }
    }
  }
}

What happens when we run it on f-timeouts-for-batch-jobs-no-task-states is:

maxrun group runs for exactly 10s. its maxrun gets killed after 10s, and maxrun2 task finishes successfully.
huge_docker_image group doesn't even get to start its prometheus task because it takes more than 2s to download the image.

I am becoming more and more convinced this is a good direction.

schmichael · 2026-04-16T00:17:55Z

After doodling a bit with a more sophisticated solution, I am slowly leaning towards a more brutal one: make the timer start immediately in the alloc runner, regardless of task state. What do you think about this?

^ + your followup comment seem exciting to me. I'm EOD here but will review the other PRs tomorrow (and check out your internal demo recording).

I can't think of a reason one approach would be more surprising than the other (to try to let least astonishment make our decision for us).

pkazmierczak · 2026-04-16T08:42:58Z

I can't think of a reason one approach would be more surprising than the other

Working on this problem I've been trying to explore "modular" solutions. In my mind, we could easily ship a "1.0" version of this that keeps things very simple, starting the timer in the allocrunner, having no regard for post-stop tasks (ok maaaybe this could be a switch), and just giving users the option to set timeouts for their task groups. We can see how the response from the community is, and later offer a more fine-grained set of knobs with a task-level max_run_duration, exact behavior of which we can decide on later, while keeping the tg-level setting more "coarse-grained" if that makes sense.

I think it's the interaction between the tg-level and task-level setting that's the hardest part to get right in this feature.

schmichael · 2026-04-27T17:44:34Z

The CLI and UI should display the deadline in alloc status and maybe job status. We can always create followup issues for that though if we don't want to clutter this PR.

When updating unified-web-docs we should also make sure to link from periodic to this so that users know they have the option of ensuring a previous run is killed before a new run would be scheduled.

pkazmierczak · 2026-05-04T16:16:28Z

We can always create followup issues for that though if we don't want to clutter this PR.

yeah I'd rather this being separate PRs if you don't mind? I need more commits on main, too. And of course I'll follow up with docs.

pkazmierczak added 6 commits April 7, 2026 17:26

core: support batch job timeouts via new field max_run_duration

605395d

refine batch timeout watcher

2388417

client status

617fe2b

integration-style test

1f4c4db

event stream

180aea3

AllocTimeoutReasonMaxRunDuration

5fff828

vercel Bot had a problem deploying to Preview April 7, 2026 15:41 Failure

pkazmierczak added 2 commits April 7, 2026 17:44

batch timeout watcher improvements

8fbfff3

added batchtimeout package to test-core.json

697c5c3

vercel Bot had a problem deploying to Preview April 7, 2026 15:47 Failure

pkazmierczak added 4 commits April 8, 2026 08:46

move to allocrunner

a31525a

remove nomad/batchtimeout

2a7fc3b

timer tweaks

e9d30e7

max_run_duration_hook

d560d47

vercel Bot had a problem deploying to Preview April 8, 2026 16:44 Failure

pkazmierczak added 5 commits April 8, 2026 19:00

TestJobs_ApiJobToStructsJob

737f23d

don't need that interface

48a9652

it's actually not a hook

eda5a4f

refinements

ab7565d

corrections

c44ba70

vercel Bot deployed to Preview April 9, 2026 15:38 View deployment

test refactror

9abba4f

vercel Bot deployed to Preview April 9, 2026 15:41 View deployment

pkazmierczak added 2 commits April 9, 2026 17:57

tidying up

0aecff5

more tidying up

f454526

pkazmierczak added theme/client theme/batch Issues related to batch jobs and scheduling theme/task lifecycle backport/2.0.x backport to 2.0.x release line labels Apr 9, 2026

pkazmierczak self-assigned this Apr 9, 2026

mismithhisler reviewed Apr 13, 2026

View reviewed changes

Comment thread client/allocrunner/tasklifecycle/max_run_duration.go Outdated

pkazmierczak added 4 commits April 14, 2026 17:58

desired status comment

a521239

allocrunner unnecessary copy fix

2f420ac

fix UTC and interface

ae2d08a

max_run_duration cleanups (thanks for the comments @mismithhisler)

bc6344c

vercel Bot deployed to Preview April 14, 2026 16:31 View deployment

mismithhisler reviewed Apr 14, 2026

View reviewed changes

client: enforce max_run_duration regardless of task state (#27827)

f246b59

vercel Bot deployed to Preview April 17, 2026 11:56 View deployment

pkazmierczak added 4 commits April 17, 2026 15:46

allow for updates to max_run_duration during job run

3b6f4df

metrics

53f6210

better logs

7841eda

fix clock arming bug

ce45ae8

vercel Bot deployed to Preview April 17, 2026 14:19 View deployment

don't start poststop tasks if the timeout has passed

6380561

vercel Bot deployed to Preview April 17, 2026 14:29 View deployment

pkazmierczak added 2 commits April 17, 2026 16:48

TestMaxRunDurationHook_EmitMetrics correction

879219a

TestServiceSched_JobModify_MaxRunDuration_InPlace fix

6a6e6a6

vercel Bot deployed to Preview April 17, 2026 14:58 View deployment

yet another fix to TestMaxRunDurationHook_EmitMetrics

e34dbc9

vercel Bot deployed to Preview April 17, 2026 15:23 View deployment

pkazmierczak requested review from mismithhisler and schmichael April 17, 2026 15:36

pkazmierczak added this to the 2.0.1 milestone Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: timeouts for batch jobs#27803

core: timeouts for batch jobs#27803
pkazmierczak wants to merge 37 commits intomainfrom
f-timeouts-for-batch-jobs

pkazmierczak commented Apr 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

pkazmierczak commented Apr 14, 2026

Uh oh!

mismithhisler Apr 14, 2026 •

edited

Loading

Uh oh!

pkazmierczak commented Apr 15, 2026

Uh oh!

schmichael commented Apr 16, 2026

Uh oh!

pkazmierczak commented Apr 16, 2026 •

edited

Loading

Uh oh!

schmichael commented Apr 27, 2026

Uh oh!

pkazmierczak commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pkazmierczak commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pkazmierczak commented Apr 14, 2026

Uh oh!

mismithhisler Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pkazmierczak commented Apr 15, 2026

Uh oh!

schmichael commented Apr 16, 2026

Uh oh!

pkazmierczak commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

schmichael commented Apr 27, 2026

Uh oh!

pkazmierczak commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pkazmierczak commented Apr 7, 2026 •

edited

Loading

mismithhisler Apr 14, 2026 •

edited

Loading

pkazmierczak commented Apr 16, 2026 •

edited

Loading